On finding cross-lingual article pairs
نویسنده
چکیده
Finding a Wikipedia article in another language is often achievable with the in-built interlanguage links. We explore the possibility to automatically generate these links for geotagged articles as an application of entity resolution on an article level. It has the potential to improve Wikipedia, but also allows to use a well-curated ground truth for the merging algorithm. The resolution is based on only the simple features of coordinates and title. This is metadata that can be taken from APIs without parsing the full article itself. We use a conflation approach to identify articles with mismatched coordinates and a translation matrix tailored to the titles. Even complicated cases such as cities, municipalities, or departments with similar names at the same coordinates can mostly be identified correctly. Honduras was chosen as a test region because the country has a limited coverage (754 articles in both languages at time of writing [2]) that allows for a full manual assessment of results and because the resulting data is a basis for a geospatial search engine [1]. This finding has not been published in such brevity before, appropriate to the selection of features. BODYCross-lingual merging of Honduran geotagged Wikipedia articles based on ar-ticle names and locations alone results in 99.4% correct pairs. REFERENCES[1] D. Ahlers. Towards Geospatial Search for Honduras. In Proceedings of the LatinamericanConference on Networked and Electronic Media LACNEM 2011, San José, Costa Rica, 2011.Universidad Latina Costa Rica.[2] D. Ahlers. Of 754 Wikipedia articles geotagged in Honduras, 345 are from the Spanish version,409 are in English., 15.Jun.12, 7:52pm. Tweet.https://twitter.com/dirkahlers/statuses/213690505630990339. Volume 1 of Tiny Transactions on Computer ScienceThis content is released under the Creative Commons Attribution-NonCommercial ShareAlike License. Permission tomake digital or hard copies of all or part of this work is granted without fee provided that copies are not made ordistributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.CC BY-NC-SA 3.0: http://creativecommons.org/licenses/by-nc-sa/3.0/.
منابع مشابه
Cross-Lingual Infobox Alignment in Wikipedia Using Entity-Attribute Factor Graph
Wikipedia infoboxes contain information about article entities in the form of attribute-value pairs, and are thus a very rich source of structured knowledge. However, as the different language versions of Wikipedia evolve independently, it is a promising but challenging problem to find correspondences between infobox attributes in different language editions. In this paper, we propose 8 effecti...
متن کاملFinding Translation Examples for Under-Resourced Language Pairs or for Narrow Domains; the Case for Machine Translation
The cyberspace is populated with valuable information sources, expressed in about 1500 different languages and dialects. Yet, for the vast majority of WEB surfers this wealth of information is practically inaccessible or meaningless. Recent advancements in cross-lingual information retrieval, multilingual summarization, cross-lingual question answering and machine translation promise to narrow ...
متن کاملCross-lingual Models of Word Embeddings: An Empirical Comparison
Despite interest in using cross-lingual knowledge to learn word embeddings for various tasks, a systematic comparison of the possible approaches is lacking in the literature. We perform an extensive evaluation of four popular approaches of inducing cross-lingual embeddings, each requiring a different form of supervision, on four typologically different language pairs. Our evaluation setup spans...
متن کاملLimitations of Cross-Lingual Learning from Image Search
Cross-lingual representation learning is an important step in making NLP scale to all the world’s languages. Recent work on bilingual lexicon induction suggests that it is possible to learn cross-lingual representations of words based on similarities between images associated with these words. However, that work focused on the translation of selected nouns only. In our work, we investigate whet...
متن کاملSemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Semantic Textual Similarity (STS) seeks to measure the degree of semantic equivalence between two snippets of text. Similarity is expressed on an ordinal scale that spans from semantic equivalence to complete unrelatedness. Intermediate values capture specifically defined levels of partial similarity. While prior evaluations constrained themselves to just monolingual snippets of text, the 2016 ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- TinyToCS
دوره 1 شماره
صفحات -
تاریخ انتشار 2012